Solution: In No Particular Order
Let's test some techniques for random selection in this lesson to solve the antipattern.
The sort-by-random technique is an example of a query that’s bound to perform a table scan and an expensive manual sort. When we design solutions in SQL, we should be on the lookout for inefficient queries like this. Instead of searching fruitlessly for a way to optimize an unoptimizable query, we need to rethink our approach. We can use the alternative techniques shown in the following sections to query a random row from a query result set. In different circumstances, each of these solutions can produce the same result with greater efficiency than sort-by-random.
Choose a random key-value between 1 and MAX#
One technique that avoids sorting the table is to choose a random value between 1 and the greatest primary key value.
This solution assumes that:
- the primary key values start at 1
- the primary key values are contiguous, that is, there are no values unused between 1 and the greatest value. If there are gaps, a randomly chosen value may not match a row in the table.
We use this solution when we know that our key uses all values between 1 and the greatest key value.
Choose the next highest key value#
This is similar to the preceding solution, but if we have gaps of unused values between 1 and the greatest key value, this query matches a random value to the first key value it finds.
This solves the problem of a random number that misses any key value, but it also means that a key value that follows a gap is chosen more often. Random values should be approximately even in distribution, but bug_id
values aren’t.
We use this solution when gaps are uncommon and when it’s not important for all key values to be chosen with equal frequency.
Get a list of all key values, choose one at random#
We can also use the application code to pick one value from the primary keys in the result set. Then, we can query the full row from the database using that primary key. This technique is shown in the following PHP code:
<?php
$bug_id_list = $pdo->query("SELECT bug_id FROM Bugs")->fetchAll();
$rand = random( count($bug_id_list) );
$rand_bug_id = $bug_id_list[$rand]["bug_id"];
$stmt = $pdo->prepare("SELECT * FROM Bugs WHERE bug_id = ?"); $stmt->execute( array($rand_bug_id) );
$rand_bug = $stmt->fetch();
?>
This avoids sorting the table, and the chance of choosing each key value is approximately equal, but this solution has other costs:
- Fetching all the
bug_id
values from the database might return a list of impractical sizes. It can even exceed application memory resources and cause an error such as the following:
Fatal error: The allowed memory size of 16777216 bytes exhausted.
- The query must be run twice: once to produce the list of primary keys and a second time to fetch the random row. If the query is too complex and costly, this is a problem.
We can use this solution when we’re selecting a random row from a simple query with a moderately sized result set. This solution is good for choosing from a list of noncontiguous values.
Choose a random row using an offset#
Still another technique that avoids the problems found in the preceding alternatives is to count the rows in the data set and return a random number between 0 and the count. Then, we can use this number as an offset when querying the data set.
<?php
$rand = "SELECT ROUND(RAND() * (SELECT COUNT(*) FROM Bugs))";
$offset = $pdo->query($rand)->fetch(PDO::FETCH_ASSOC);
$sql = "SELECT * FROM Bugs LIMIT 1 OFFSET :offset";
$stmt = $pdo->prepare($sql);
$stmt->execute( $offset );
$rand_bug = $stmt->fetch();
?>
This solution relies on the non-standard LIMIT
clause, supported by MySQL, PostgreSQL, and SQLite.
An alternative that uses the ROW_NUMBER()
window function works in Oracle, Microsoft SQL Server, and IBM DB2.
For example, here’s the solution in Oracle:
<?php
$rand = "SELECT 1 + MOD(ABS(dbms_random.random()), (SELECT COUNT(*) FROM Bugs)) AS offset FROM dual";
$offset = $pdo->query($rand)->fetch(PDO::FETCH_ASSOC);
$sql = "WITH NumberedBugs AS ( SELECT b.*, ROW_NUMBER() OVER (ORDER BY bug_id) AS RN FROM Bugs b ) SELECT * FROM NumberedBugs WHERE RN = :offset";
$stmt = $pdo->prepare($sql);
$stmt->execute( $offset );
$rand_bug = $stmt->fetch();
?>
We can use this solution when we can’t assume contiguous key values, and we need to make sure each row has an even chance of being selected.
Proprietary solutions#
Any given brand of database might implement its own solution for this kind of task. For example, Microsoft SQL Server 2005 added a TABLESAMPLE
clause:
Oracle uses a slightly different SAMPLE
clause, for example, to return 1 percent of the rows in the table:
We should read the documentation for the appropriate solution in our brand of database. There are often limitations or other options that we need to know about.